safety training
- North America > United States > California > San Francisco County > San Francisco (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (1.00)
- Africa > South Africa (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Workflow (0.67)
- Research Report > New Finding (0.67)
- Media (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Detecting and Steering LLMs' Empathy in Action
We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed.
Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
Zanbaghi, Shahin, Rostampour, Ryan, Abid, Farhan, Jarmakani, Salim Al
Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- North America > United States (1.00)
- Africa > South Africa (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Workflow (0.67)
- Research Report > New Finding (0.67)
- Media (1.00)
- Law > Civil Rights & Constitutional Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (9 more...)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Teen killed himself after 'months of encouragement from ChatGPT', lawsuit claims
The makers of ChatGPT are changing the way it responds to users who show mental and emotional distress after legal action from the family of 16-year-old Adam Raine, who killed himself after months of conversations with the chatbot. Open AI admitted its systems could "fall short" and said it would install "stronger guardrails around sensitive content and risky behaviors" for users under 18. The 500bn ( 372bn) San Francisco AI company said it would also introduce parental controls to allow parents "options to gain more insight into, and shape, how their teens use ChatGPT", but has yet to provide details about how these would work. Adam, from California, killed himself in April after what his family's lawyer called "months of encouragement from ChatGPT". The teenager's family is suing Open AI and its chief executive and co-founder, Sam Altman, alleging that the version of ChatGPT at that time, known as 4o, was "rushed to market … despite clear safety issues".
- North America > United States > California > San Francisco County > San Francisco (0.27)
- Oceania > Australia (0.06)
- Europe > United Kingdom (0.06)
- Europe > Ireland (0.06)
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Kumarage, Tharindu, Mehrabi, Ninareh, Ramakrishna, Anil, Zhao, Xinyan, Zemel, Richard, Chang, Kai-Wei, Galstyan, Aram, Gupta, Rahul, Peris, Charith
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE
- Law (1.00)
- Government (1.00)
- Education (0.67)
- Media (0.67)
A generative approach to LLM harmfulness detection with special red flag tokens
Xhonneux, Sophie, Dobre, David, Mofakhami, Mehrnaz, Schwinn, Leo, Gidel, Gauthier
Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token (
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
Addepalli, Sravanti, Varun, Yerram, Suggala, Arun, Shanmugam, Karthikeyan, Jain, Prateek
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.
- Information Technology > Security & Privacy (1.00)
- Government (1.00)